1 Introduction

This is an extensive Exploratory Data Analysis for the Porto Seguro’s Safe Driver Prediction competition within the R environment of the tidyverse and ggplot2. We will visualise all the different data features, their relation to the target variable, explore multi-parameter interactions, and perform feature engineering.

The aim of this challenge is to predict the probability whether a driver will make an insurance claim, with the purpose of providing a fairer insurance cost on the basis of individual driving habits. It is sponsored by Porto Seguro - a major car and home insurance company in Brazil

The data comes in the traditional Kaggle form of one training and test file each: ../input/train.csv & ../input/test.csv. Each row corresponds to a specific policy holder and the columns describe their features. The target variable is conveniently named target here and it indicates whether this policy holder made an insurance claim in the past.

Preparations

Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools.

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('ggthemes') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('rlang') # data manipulation

# specific visualisation
library('alluvial') # visualisation
library('ggfortify') # visualisation
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'call.'

## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'call.'
library('ggrepel') # visualisation
library('ggridges') # visualisation
library('VIM') # NAs
library('plotly') # interactive
library('ggforce') # visualisation

# modelling
library('xgboost') # modelling
library('caret') # modelling
library('MLmetrics') # gini metric

Helper functions

We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots. We also make use of a brief helper function to compute binomial confidence intervals.

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
# function to extract binomial confidence levels
get_binCI <- function(x,n) as.list(setNames(binom.test(x,n)$conf.int, c("lwr", "upr")))

Load data

We use data.table’s fread function to speed up reading in the data, even though in this challenge our files are not very large with about 110 MB for train and 165 MB for test. Here we are taking into account the fact that missing values in the original data sets are indicated by -1 or -1.0 and turn those into “proper” NAs.

train <- as.tibble(fread('input/train.csv', na.strings=c("-1","-1.0")))
test <- as.tibble(fread('input/test.csv', na.strings=c("-1","-1.0")))
sample_submit <- as.tibble(fread('input/sample_submission.csv'))

Overview: File structure and content

As a first step let’s have an overview of the data sets using the summary and glimpse tools.

Training data

summary(train)
##        id              target          ps_ind_01   ps_ind_02_cat 
##  Min.   :      7   Min.   :0.00000   Min.   :0.0   Min.   :1.00  
##  1st Qu.: 371992   1st Qu.:0.00000   1st Qu.:0.0   1st Qu.:1.00  
##  Median : 743548   Median :0.00000   Median :1.0   Median :1.00  
##  Mean   : 743804   Mean   :0.03645   Mean   :1.9   Mean   :1.36  
##  3rd Qu.:1115549   3rd Qu.:0.00000   3rd Qu.:3.0   3rd Qu.:2.00  
##  Max.   :1488027   Max.   :1.00000   Max.   :7.0   Max.   :4.00  
##                                                    NA's   :216   
##    ps_ind_03      ps_ind_04_cat   ps_ind_05_cat   ps_ind_06_bin   
##  Min.   : 0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median : 4.000   Median :0.000   Median :0.000   Median :0.0000  
##  Mean   : 4.423   Mean   :0.417   Mean   :0.419   Mean   :0.3937  
##  3rd Qu.: 6.000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:1.0000  
##  Max.   :11.000   Max.   :1.000   Max.   :6.000   Max.   :1.0000  
##                   NA's   :83      NA's   :5809                    
##  ps_ind_07_bin   ps_ind_08_bin    ps_ind_09_bin    ps_ind_10_bin     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.000000  
##  Mean   :0.257   Mean   :0.1639   Mean   :0.1853   Mean   :0.000373  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000000  
##                                                                      
##  ps_ind_11_bin      ps_ind_12_bin      ps_ind_13_bin      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.000000   Median :0.000000   Median :0.0000000  
##  Mean   :0.001692   Mean   :0.009439   Mean   :0.0009476  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000000  
##                                                           
##    ps_ind_14         ps_ind_15    ps_ind_16_bin    ps_ind_17_bin   
##  Min.   :0.00000   Min.   : 0.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.: 5.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median : 7.0   Median :1.0000   Median :0.0000  
##  Mean   :0.01245   Mean   : 7.3   Mean   :0.6608   Mean   :0.1211  
##  3rd Qu.:0.00000   3rd Qu.:10.0   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :4.00000   Max.   :13.0   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##  ps_ind_18_bin      ps_reg_01       ps_reg_02        ps_reg_03     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.06    
##  1st Qu.:0.0000   1st Qu.:0.400   1st Qu.:0.2000   1st Qu.:0.63    
##  Median :0.0000   Median :0.700   Median :0.3000   Median :0.80    
##  Mean   :0.1534   Mean   :0.611   Mean   :0.4392   Mean   :0.89    
##  3rd Qu.:0.0000   3rd Qu.:0.900   3rd Qu.:0.6000   3rd Qu.:1.08    
##  Max.   :1.0000   Max.   :0.900   Max.   :1.8000   Max.   :4.04    
##                                                    NA's   :107772  
##  ps_car_01_cat    ps_car_02_cat    ps_car_03_cat    ps_car_04_cat   
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0      Min.   :0.0000  
##  1st Qu.: 7.000   1st Qu.:1.0000   1st Qu.:0.0      1st Qu.:0.0000  
##  Median : 7.000   Median :1.0000   Median :1.0      Median :0.0000  
##  Mean   : 8.298   Mean   :0.8299   Mean   :0.6      Mean   :0.7252  
##  3rd Qu.:11.000   3rd Qu.:1.0000   3rd Qu.:1.0      3rd Qu.:0.0000  
##  Max.   :11.000   Max.   :1.0000   Max.   :1.0      Max.   :9.0000  
##  NA's   :107      NA's   :5        NA's   :411231                   
##  ps_car_05_cat    ps_car_06_cat    ps_car_07_cat   ps_car_08_cat   
##  Min.   :0.00     Min.   : 0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.00     1st Qu.: 1.000   1st Qu.:1.000   1st Qu.:1.0000  
##  Median :1.00     Median : 7.000   Median :1.000   Median :1.0000  
##  Mean   :0.53     Mean   : 6.555   Mean   :0.948   Mean   :0.8321  
##  3rd Qu.:1.00     3rd Qu.:11.000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.00     Max.   :17.000   Max.   :1.000   Max.   :1.0000  
##  NA's   :266551                    NA's   :11489                   
##  ps_car_09_cat   ps_car_10_cat    ps_car_11_cat      ps_car_11    
##  Min.   :0.000   Min.   :0.0000   Min.   :  1.00   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.: 32.00   1st Qu.:2.000  
##  Median :2.000   Median :1.0000   Median : 65.00   Median :3.000  
##  Mean   :1.331   Mean   :0.9921   Mean   : 62.22   Mean   :2.346  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 93.00   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :2.0000   Max.   :104.00   Max.   :3.000  
##  NA's   :569                                       NA's   :5      
##    ps_car_12        ps_car_13        ps_car_14       ps_car_15    
##  Min.   :0.1000   Min.   :0.2506   Min.   :0.11    Min.   :0.000  
##  1st Qu.:0.3162   1st Qu.:0.6709   1st Qu.:0.35    1st Qu.:2.828  
##  Median :0.3742   Median :0.7658   Median :0.37    Median :3.317  
##  Mean   :0.3799   Mean   :0.8133   Mean   :0.37    Mean   :3.066  
##  3rd Qu.:0.4000   3rd Qu.:0.9062   3rd Qu.:0.40    3rd Qu.:3.606  
##  Max.   :1.2649   Max.   :3.7206   Max.   :0.64    Max.   :3.742  
##  NA's   :1                         NA's   :42620                  
##    ps_calc_01       ps_calc_02       ps_calc_03       ps_calc_04   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.2000   1st Qu.:0.2000   1st Qu.:0.2000   1st Qu.:2.000  
##  Median :0.5000   Median :0.4000   Median :0.5000   Median :2.000  
##  Mean   :0.4498   Mean   :0.4496   Mean   :0.4498   Mean   :2.372  
##  3rd Qu.:0.7000   3rd Qu.:0.7000   3rd Qu.:0.7000   3rd Qu.:3.000  
##  Max.   :0.9000   Max.   :0.9000   Max.   :0.9000   Max.   :5.000  
##                                                                    
##    ps_calc_05      ps_calc_06       ps_calc_07      ps_calc_08    
##  Min.   :0.000   Min.   : 0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:1.000   1st Qu.: 7.000   1st Qu.:2.000   1st Qu.: 8.000  
##  Median :2.000   Median : 8.000   Median :3.000   Median : 9.000  
##  Mean   :1.886   Mean   : 7.689   Mean   :3.006   Mean   : 9.226  
##  3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.:4.000   3rd Qu.:10.000  
##  Max.   :6.000   Max.   :10.000   Max.   :9.000   Max.   :12.000  
##                                                                   
##    ps_calc_09      ps_calc_10       ps_calc_11       ps_calc_12    
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.: 6.000   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :2.000   Median : 8.000   Median : 5.000   Median : 1.000  
##  Mean   :2.339   Mean   : 8.434   Mean   : 5.441   Mean   : 1.442  
##  3rd Qu.:3.000   3rd Qu.:10.000   3rd Qu.: 7.000   3rd Qu.: 2.000  
##  Max.   :7.000   Max.   :25.000   Max.   :19.000   Max.   :10.000  
##                                                                    
##    ps_calc_13       ps_calc_14     ps_calc_15_bin   ps_calc_16_bin  
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: 6.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000   Median : 7.000   Median :0.0000   Median :1.0000  
##  Mean   : 2.872   Mean   : 7.539   Mean   :0.1224   Mean   :0.6278  
##  3rd Qu.: 4.000   3rd Qu.: 9.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :13.000   Max.   :23.000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##  ps_calc_17_bin   ps_calc_18_bin   ps_calc_19_bin  ps_calc_20_bin  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.5542   Mean   :0.2872   Mean   :0.349   Mean   :0.1533  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
## 
glimpse(train)
## Observations: 595,212
## Variables: 59
## $ id             <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01      <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat  <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03      <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin  <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin  <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15      <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin  <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01      <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02      <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03      <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat  <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat  <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat  <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat  <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat  <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat  <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat  <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11      <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12      <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13      <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14      <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15      <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01     <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02     <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03     <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04     <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05     <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06     <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07     <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08     <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09     <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10     <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11     <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12     <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13     <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14     <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...

We find:

  • There are lots of features here. In total, our training data has 59 variables, including id and target. In some of them we already see a number of NAs.

  • The data description mentions that the names of the features indicate whether they are binary (bin) or categorical (cat) variables. Everything else is continuous or ordinal.

  • We have already been told by Adriano Moala that the names of the variables indicate certain properties: *Ind" is related to individual or driver, “reg” is related to region, “car” is related to car itself and “calc” is an calculated feature.’ Here we will refer to these properties as groups.

  • Note, that there is a ps_car_11 as well as a ps_car_11_cat. This is the only occasion where the numbering per group is neither consecutive nor unique. Probably a typo in the script that created the variable names.

  • The features are anonymised, because that worked so well in Mercedes. Just kidding. Mostly ;-)

Test data:

summary(test)
##        id            ps_ind_01     ps_ind_02_cat     ps_ind_03     
##  Min.   :      0   Min.   :0.000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 372022   1st Qu.:0.000   1st Qu.:1.000   1st Qu.: 2.000  
##  Median : 744307   Median :1.000   Median :1.000   Median : 4.000  
##  Mean   : 744154   Mean   :1.902   Mean   :1.359   Mean   : 4.414  
##  3rd Qu.:1116309   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 6.000  
##  Max.   :1488026   Max.   :7.000   Max.   :4.000   Max.   :11.000  
##                                    NA's   :307                     
##  ps_ind_04_cat    ps_ind_05_cat   ps_ind_06_bin    ps_ind_07_bin   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.4176   Mean   :0.422   Mean   :0.3932   Mean   :0.2572  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :145      NA's   :8710                                     
##  ps_ind_08_bin    ps_ind_09_bin    ps_ind_10_bin      ps_ind_11_bin     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.0000   Median :0.0000   Median :0.000000   Median :0.000000  
##  Mean   :0.1637   Mean   :0.1859   Mean   :0.000373   Mean   :0.001595  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000000   Max.   :1.000000  
##                                                                         
##  ps_ind_12_bin      ps_ind_13_bin        ps_ind_14         ps_ind_15     
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00000   Min.   : 0.000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.: 5.000  
##  Median :0.000000   Median :0.000000   Median :0.00000   Median : 7.000  
##  Mean   :0.009376   Mean   :0.001039   Mean   :0.01238   Mean   : 7.297  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:10.000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :4.00000   Max.   :13.000  
##                                                                          
##  ps_ind_16_bin    ps_ind_17_bin    ps_ind_18_bin     ps_reg_01     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.4000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.7000  
##  Mean   :0.6606   Mean   :0.1204   Mean   :0.155   Mean   :0.6111  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.9000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :0.9000  
##                                                                    
##    ps_reg_02        ps_reg_03      ps_car_01_cat    ps_car_02_cat 
##  Min.   :0.0000   Min.   :0.06     Min.   : 0.000   Min.   :0.00  
##  1st Qu.:0.2000   1st Qu.:0.63     1st Qu.: 7.000   1st Qu.:1.00  
##  Median :0.3000   Median :0.80     Median : 7.000   Median :1.00  
##  Mean   :0.4399   Mean   :0.89     Mean   : 8.294   Mean   :0.83  
##  3rd Qu.:0.6000   3rd Qu.:1.09     3rd Qu.:11.000   3rd Qu.:1.00  
##  Max.   :1.8000   Max.   :4.42     Max.   :11.000   Max.   :1.00  
##                   NA's   :161684   NA's   :160      NA's   :5     
##  ps_car_03_cat    ps_car_04_cat    ps_car_05_cat    ps_car_06_cat   
##  Min.   :0.0      Min.   :0.0000   Min.   :0.0      Min.   : 0.000  
##  1st Qu.:0.0      1st Qu.:0.0000   1st Qu.:0.0      1st Qu.: 1.000  
##  Median :1.0      Median :0.0000   Median :1.0      Median : 7.000  
##  Mean   :0.6      Mean   :0.7258   Mean   :0.5      Mean   : 6.564  
##  3rd Qu.:1.0      3rd Qu.:0.0000   3rd Qu.:1.0      3rd Qu.:11.000  
##  Max.   :1.0      Max.   :9.0000   Max.   :1.0      Max.   :17.000  
##  NA's   :616911                    NA's   :400359                   
##  ps_car_07_cat   ps_car_08_cat    ps_car_09_cat  ps_car_10_cat   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.00   1st Qu.:1.0000  
##  Median :1.000   Median :1.0000   Median :2.00   Median :1.0000  
##  Mean   :0.948   Mean   :0.8323   Mean   :1.33   Mean   :0.9921  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:2.00   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :4.00   Max.   :2.0000  
##  NA's   :17331                    NA's   :877                    
##  ps_car_11_cat      ps_car_11       ps_car_12        ps_car_13     
##  Min.   :  1.00   Min.   :0.000   Min.   :0.1414   Min.   :0.2758  
##  1st Qu.: 32.00   1st Qu.:2.000   1st Qu.:0.3162   1st Qu.:0.6712  
##  Median : 65.00   Median :3.000   Median :0.3742   Median :0.7661  
##  Mean   : 62.28   Mean   :2.347   Mean   :0.3800   Mean   :0.8136  
##  3rd Qu.: 94.00   3rd Qu.:3.000   3rd Qu.:0.4000   3rd Qu.:0.9061  
##  Max.   :104.00   Max.   :3.000   Max.   :1.2649   Max.   :4.0313  
##                   NA's   :1                                        
##    ps_car_14       ps_car_15       ps_calc_01       ps_calc_02    
##  Min.   :0.11    Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.35    1st Qu.:2.828   1st Qu.:0.2000   1st Qu.:0.2000  
##  Median :0.37    Median :3.317   Median :0.4000   Median :0.5000  
##  Mean   :0.37    Mean   :3.068   Mean   :0.4496   Mean   :0.4505  
##  3rd Qu.:0.40    3rd Qu.:3.606   3rd Qu.:0.7000   3rd Qu.:0.7000  
##  Max.   :0.64    Max.   :3.742   Max.   :0.9000   Max.   :0.9000  
##  NA's   :63805                                                    
##    ps_calc_03       ps_calc_04      ps_calc_05      ps_calc_06    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 1.000  
##  1st Qu.:0.2000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 7.000  
##  Median :0.4000   Median :2.000   Median :2.000   Median : 8.000  
##  Mean   :0.4501   Mean   :2.371   Mean   :1.885   Mean   : 7.688  
##  3rd Qu.:0.7000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :0.9000   Max.   :5.000   Max.   :6.000   Max.   :10.000  
##                                                                   
##    ps_calc_07     ps_calc_08       ps_calc_09      ps_calc_10    
##  Min.   :0.00   Min.   : 1.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:2.00   1st Qu.: 8.000   1st Qu.:1.000   1st Qu.: 6.000  
##  Median :3.00   Median : 9.000   Median :2.000   Median : 8.000  
##  Mean   :3.01   Mean   : 9.226   Mean   :2.339   Mean   : 8.443  
##  3rd Qu.:4.00   3rd Qu.:10.000   3rd Qu.:3.000   3rd Qu.:10.000  
##  Max.   :9.00   Max.   :12.000   Max.   :7.000   Max.   :25.000  
##                                                                  
##    ps_calc_11       ps_calc_12      ps_calc_13       ps_calc_14   
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 4.000   1st Qu.: 1.00   1st Qu.: 2.000   1st Qu.: 6.00  
##  Median : 5.000   Median : 1.00   Median : 3.000   Median : 7.00  
##  Mean   : 5.438   Mean   : 1.44   Mean   : 2.875   Mean   : 7.54  
##  3rd Qu.: 7.000   3rd Qu.: 2.00   3rd Qu.: 4.000   3rd Qu.: 9.00  
##  Max.   :20.000   Max.   :11.00   Max.   :15.000   Max.   :28.00  
##                                                                   
##  ps_calc_15_bin   ps_calc_16_bin   ps_calc_17_bin   ps_calc_18_bin  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.1237   Mean   :0.6278   Mean   :0.5547   Mean   :0.2878  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##  ps_calc_19_bin   ps_calc_20_bin  
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3493   Mean   :0.1524  
##  3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000  
## 
glimpse(test)
## Observations: 892,816
## Variables: 58
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 18,...
## $ ps_ind_01      <int> 0, 4, 5, 0, 5, 0, 0, 0, 0, 1, 0, 1, 1, 3, 0, 2,...
## $ ps_ind_02_cat  <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,...
## $ ps_ind_03      <int> 8, 5, 3, 6, 7, 6, 3, 0, 7, 6, 5, 4, 2, 3, 1, 2,...
## $ ps_ind_04_cat  <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_ind_05_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 6, 0,...
## $ ps_ind_06_bin  <int> 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ ps_ind_07_bin  <int> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ps_ind_08_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ ps_ind_09_bin  <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_10_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15      <int> 12, 5, 10, 4, 4, 10, 11, 7, 6, 7, 3, 9, 8, 0, 8...
## $ ps_ind_16_bin  <int> 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ ps_ind_17_bin  <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_reg_01      <dbl> 0.5, 0.9, 0.4, 0.1, 0.9, 0.9, 0.1, 0.9, 0.4, 0....
## $ ps_reg_02      <dbl> 0.3, 0.5, 0.0, 0.2, 0.4, 0.5, 0.1, 1.1, 0.0, 1....
## $ ps_reg_03      <dbl> 0.6103278, 0.7713624, 0.9161741, NA, 0.8177714,...
## $ ps_car_01_cat  <int> 7, 4, 11, 7, 11, 9, 6, 7, 11, 11, 11, 11, 11, 1...
## $ ps_car_02_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_car_03_cat  <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ ps_car_04_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2,...
## $ ps_car_05_cat  <int> NA, 0, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat  <int> 1, 11, 14, 1, 11, 11, 1, 11, 2, 4, 11, 7, 6, 1,...
## $ ps_car_07_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1...
## $ ps_car_08_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,...
## $ ps_car_09_cat  <int> 2, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 2,...
## $ ps_car_10_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat  <int> 65, 103, 29, 40, 101, 11, 10, 103, 104, 104, 10...
## $ ps_car_11      <int> 1, 1, 3, 2, 3, 2, 2, 3, 2, 2, 3, 3, 2, 1, 2, 0,...
## $ ps_car_12      <dbl> 0.3162278, 0.3162278, 0.4000000, 0.3741657, 0.3...
## $ ps_car_13      <dbl> 0.6695564, 0.6063200, 0.8962387, 0.6521104, 0.8...
## $ ps_car_14      <dbl> 0.3521363, 0.3583295, 0.3984972, 0.3814446, 0.3...
## $ ps_car_15      <dbl> 3.464102, 2.828427, 3.316625, 2.449490, 3.31662...
## $ ps_calc_01     <dbl> 0.1, 0.4, 0.6, 0.1, 0.9, 0.7, 0.9, 0.8, 0.9, 0....
## $ ps_calc_02     <dbl> 0.8, 0.5, 0.6, 0.5, 0.6, 0.9, 0.8, 0.9, 0.3, 0....
## $ ps_calc_03     <dbl> 0.6, 0.4, 0.6, 0.5, 0.8, 0.4, 0.8, 0.5, 0.0, 0....
## $ ps_calc_04     <int> 1, 3, 2, 2, 3, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 1,...
## $ ps_calc_05     <int> 1, 3, 3, 1, 4, 1, 1, 2, 2, 1, 2, 2, 1, 2, 3, 4,...
## $ ps_calc_06     <int> 6, 8, 7, 7, 7, 9, 7, 8, 9, 7, 5, 9, 8, 7, 7, 8,...
## $ ps_calc_07     <int> 3, 4, 4, 3, 1, 5, 3, 4, 7, 1, 5, 3, 0, 1, 3, 3,...
## $ ps_calc_08     <int> 6, 10, 6, 12, 10, 9, 9, 11, 9, 9, 7, 10, 9, 9, ...
## $ ps_calc_09     <int> 2, 2, 3, 1, 4, 4, 5, 2, 0, 1, 2, 1, 4, 1, 4, 3,...
## $ ps_calc_10     <int> 9, 7, 12, 13, 12, 12, 6, 8, 10, 11, 10, 4, 7, 1...
## $ ps_calc_11     <int> 1, 2, 4, 5, 4, 8, 2, 3, 5, 6, 2, 4, 3, 7, 6, 9,...
## $ ps_calc_12     <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 2, 4, 0, 3, 0,...
## $ ps_calc_13     <int> 1, 3, 2, 0, 0, 4, 4, 4, 4, 6, 1, 7, 0, 5, 5, 2,...
## $ ps_calc_14     <int> 12, 10, 4, 5, 4, 9, 6, 9, 6, 10, 8, 8, 12, 9, 6...
## $ ps_calc_15_bin <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ ps_calc_17_bin <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,...
## $ ps_calc_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_calc_19_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_20_bin <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

We find:

Missing values

sum(is.na(train))
## [1] 846458
sum(is.na(test))
## [1] 1270295

There are of the order of 1 million missing values per data set. This is not trivial.

Reformating features

We will turn the categorical features into factors and the binary ones into logical values. For the target variable we choose a factor format.

train <- train %>%
  mutate_at(vars(ends_with("cat")), funs(factor)) %>%
  mutate_at(vars(ends_with("bin")), funs(as.logical)) %>%
  mutate(target = as.factor(target))
test <- test %>%
  mutate_at(vars(ends_with("cat")), funs(factor)) %>%
  mutate_at(vars(ends_with("bin")), funs(as.logical))

Combining data frames

Here combine the train and test data frames for future homogeneous treatment.

combine <- bind_rows(train %>% mutate(dset = "train"), 
                     test %>% mutate(dset = "test",
                                     target = NA))
combine <- combine %>% mutate(dset = factor(dset))

Individual feature visualisations

We start our exploration with overview distribution plots for the various features. In order to make this visualisation more comprehensive, we will create layouts for the specific groups of features. For the sake of readability we divide each group into multiple parts.

These plots will be one of the pillars of our analysis. They might not be particularly exciting in themselves, but whenever we find an interesting effect in one of the variables we can come back here and examine their distribution. It’s always an advantage to start with a clear view of the parameter space.

Binary features part 1

p1 <- train %>%
  ggplot(aes(ps_ind_06_bin, fill = ps_ind_06_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_ind_07_bin, fill = ps_ind_07_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_ind_08_bin, fill = ps_ind_08_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_ind_09_bin, fill = ps_ind_09_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_ind_10_bin, fill = ps_ind_10_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_ind_11_bin, fill = ps_ind_11_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p7 <- train %>%
  ggplot(aes(ps_ind_12_bin, fill = ps_ind_12_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p8 <- train %>%
  ggplot(aes(ps_ind_13_bin, fill = ps_ind_13_bin)) +
  geom_bar() +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,6,7,8),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, layout=layout)
Fig. 1

Fig. 1

We find that some of the binary features are very unbalanced; with “FALSE” accounting for the vast majority of cases. This is particularly true for the ps_ind sequence from “10” to “13”.

Binary features part 2

p1 <- train %>%
  ggplot(aes(ps_ind_16_bin, fill = ps_ind_16_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_ind_17_bin, fill = ps_ind_17_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_ind_18_bin, fill = ps_ind_18_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_calc_15_bin, fill = ps_calc_15_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_calc_16_bin, fill = ps_calc_16_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_calc_17_bin, fill = ps_calc_17_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p7 <- train %>%
  ggplot(aes(ps_calc_18_bin, fill = ps_calc_18_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p8 <- train %>%
  ggplot(aes(ps_calc_19_bin, fill = ps_calc_19_bin)) +
  geom_bar() +
  theme(legend.position = "none")

p9 <- train %>%
  ggplot(aes(ps_calc_20_bin, fill = ps_calc_20_bin)) +
  geom_bar() +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,6,7,8,9,9),2,5,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, layout=layout)
Fig. 2

Fig. 2

We find that in this particular set of binary features we have more of a balance between “TRUE” and “FALSE”. For the three features ps_ind_16_bin, ps_calc_16_bin, and ps_calc_17_bin we find that the “TRUE” values are in fact dominating.

Categorical features part 1

Note the logarithmic y-axes:

p1 <- train %>%
  ggplot(aes(ps_ind_02_cat, fill = ps_ind_02_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_ind_04_cat, fill = ps_ind_04_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_ind_05_cat, fill = ps_ind_05_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_car_01_cat, fill = ps_car_01_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_car_02_cat, fill = ps_car_02_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_car_03_cat, fill = ps_car_03_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 3

Fig. 3

We find that some categorical features have only very few levels, down to 2 levels (+ NA) for three of them. In others we have up to 11 levels, some of which are clearly dominating the (logarithmic) plots.

Categorical features part 2

Note again the logarithmic y-axes:

p1 <- train %>%
  ggplot(aes(ps_car_04_cat, fill = ps_car_04_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_car_05_cat, fill = ps_car_05_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_car_06_cat, fill = ps_car_06_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_car_07_cat, fill = ps_car_07_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_car_08_cat, fill = ps_car_08_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_car_09_cat, fill = ps_car_09_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p7 <- train %>%
  ggplot(aes(ps_car_10_cat, fill = ps_car_10_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p8 <- train %>%
  ggplot(aes(ps_car_11_cat, fill = ps_car_11_cat)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

layout <- matrix(c(1,1,2,3,4,4,5,5,6,6,7,7,8,8,8,8),4,4,byrow=TRUE)
multiplot(p1, p2, p4, p3, p5, p6, p7, p8, layout=layout)
Fig. 4

Fig. 4

We find that also here the number of levels is mostly low. Recall that the car features are related to the automobile itself. Feature “11” has lots of levels. This is the one number shared by a (supposedly) categorical and an integer feature. Maybe there has been a mixup in naming these features?

Integer features part 1: “ind” and “car”

The integer features for the “ind” and “car” groups are best visualised in a categorical-style barplot, because their ranges are not very large. We are using log axes for some.

p1 <- train %>%
  mutate(ps_ind_01 = as.factor(ps_ind_01)) %>%
  ggplot(aes(ps_ind_01, fill = ps_ind_01)) +
  geom_bar() +
  theme(legend.position = "none")

p2 <- train %>%
  mutate(ps_ind_03 = as.factor(ps_ind_03)) %>%
  ggplot(aes(ps_ind_03, fill = ps_ind_03)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  mutate(ps_ind_14 = as.factor(ps_ind_14)) %>%
  ggplot(aes(ps_ind_14, fill = ps_ind_14)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

p4 <- train %>%
  mutate(ps_ind_15 = as.factor(ps_ind_15)) %>%
  ggplot(aes(ps_ind_15, fill = ps_ind_15)) +
  geom_bar() +
  theme(legend.position = "none")

p5 <- train %>%
  mutate(ps_car_11 = as.factor(ps_car_11)) %>%
  ggplot(aes(ps_car_11, fill = ps_car_11)) +
  geom_bar() +
  theme(legend.position = "none")


layout <- matrix(c(1,1,2,2,3,4,4,5),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)
Fig. 5

Fig. 5

We find that again there are large differences in frequencies, in particular for ps_ind_14 and ps_car_11 where “0” and “3” are the dominating values, respectively.

Integer features part 2: “calc”

Whereas most of the “calc” integer features can still be visualised best using barplots, for three of them a histogram is a better choice:

p1 <- train %>%
  mutate(ps_calc_04 = as.factor(ps_calc_04)) %>%
  ggplot(aes(ps_calc_04, fill = ps_calc_04)) +
  geom_bar() +
  theme(legend.position = "none")

p2 <- train %>%
  mutate(ps_calc_05 = as.factor(ps_calc_05)) %>%
  ggplot(aes(ps_calc_05, fill = ps_calc_05)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  mutate(ps_calc_06 = as.factor(ps_calc_06)) %>%
  ggplot(aes(ps_calc_06, fill = ps_calc_06)) +
  geom_bar() +
  theme(legend.position = "none")

p4 <- train %>%
  mutate(ps_calc_07 = as.factor(ps_calc_07)) %>%
  ggplot(aes(ps_calc_07, fill = ps_calc_07)) +
  geom_bar() +
  theme(legend.position = "none")

p5 <- train %>%
  mutate(ps_calc_08 = as.factor(ps_calc_08)) %>%
  ggplot(aes(ps_calc_08, fill = ps_calc_08)) +
  geom_bar() +
  theme(legend.position = "none")

p6 <- train %>%
  mutate(ps_calc_09 = as.factor(ps_calc_09)) %>%
  ggplot(aes(ps_calc_09, fill = ps_calc_09)) +
  geom_bar() +
  theme(legend.position = "none")

p7 <- train %>%
  ggplot(aes(ps_calc_10, fill = ps_calc_10)) +
  geom_histogram(fill = "blue", binwidth = 1) +
  theme(legend.position = "none")

p8 <- train %>%
  ggplot(aes(ps_calc_11, fill = ps_calc_11)) +
  geom_histogram(fill = "blue", binwidth = 1) +
  theme(legend.position = "none")

p9 <- train %>%
  mutate(ps_calc_12 = as.factor(ps_calc_12)) %>%
  ggplot(aes(ps_calc_12, fill = ps_calc_12)) +
  geom_bar() +
  theme(legend.position = "none")

p10 <- train %>%
  mutate(ps_calc_13 = as.factor(ps_calc_13)) %>%
  ggplot(aes(ps_calc_13, fill = ps_calc_13)) +
  geom_bar() +
  theme(legend.position = "none")

p11 <- train %>%
  ggplot(aes(ps_calc_14, fill = ps_calc_14)) +
  geom_histogram(fill = "blue", binwidth = 1) +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,11),3,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, layout=layout)
Fig. 6

Fig. 6

We find that the histogram features “10”, “11”, and “14” have close to normal looking distributions with possibly more pronounced tails towards larger values. The other features are not far from a normal or log-normal distribution either and consequently display significant ranges in frequency.

Float features part 1: “reg” and “calc”

For the floating point features we choose histograms to get a first impression of their distributions:

p1 <- train %>%
  ggplot(aes(ps_reg_01, fill = ps_reg_01)) +
  geom_histogram(fill = "dark green", binwidth = 0.1) +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_reg_02, fill = ps_reg_02)) +
  geom_histogram(fill = "dark green", binwidth = 0.1) +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_reg_03, fill = ps_reg_03)) +
  geom_histogram(fill = "dark green", binwidth = 0.1) +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_calc_01, fill = ps_calc_01)) +
  geom_histogram(fill = "blue", binwidth = 0.1) +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_calc_02, fill = ps_calc_02)) +
  geom_histogram(fill = "blue", binwidth = 0.1) +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_calc_03, fill = ps_calc_03)) +
  geom_histogram(fill = "blue", binwidth = 0.1) +
  theme(legend.position = "none")



layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 7

Fig. 7

We find that while the (green) “reg” features show distributions that are clearly skewed toward a prominent peak, the (blue) “calc” features appear to be pretty uniformly distributed.

Float features part 2: “car”

Also the second part of these features will be visualised using histograms:

p1 <- train %>%
  ggplot(aes(ps_car_12, fill = ps_car_12)) +
  geom_histogram(fill = "red", binwidth = 0.05) +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_car_13, fill = ps_car_13)) +
  geom_histogram(fill = "red", binwidth = 0.1) +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_car_14, fill = ps_car_14)) +
  geom_histogram(fill = "red", binwidth = 0.01) +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_car_15, fill = ps_car_15)) +
  geom_histogram(fill = "red", binwidth = 0.1) +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 8

Fig. 8

We find that the two features on the left show interesting sub-structure in their distributions, while ps_car_15 appears to take only quite distinct values until after ps_car_15 == 3when the gaps decrease notably.

Target variable

Finally, this is what it is all about: whether a claim has been filed (“1”) or not (“0”):

train %>%
  ggplot(aes(target, fill = target)) +
  geom_bar() +
  theme(legend.position = "none")
Fig. 9

Fig. 9

We find:

  • The majority of cases has no filed claim:
train %>%
  group_by(target) %>%
  summarise(percentage = n()/nrow(train)*100)
## # A tibble: 2 x 2
##   target percentage
##   <fctr>      <dbl>
## 1      0  96.355248
## 2      1   3.644752

With less than 4% of policy holders filing a claim the problem is heavily imbalanced.

More details on missing values

Before proceeding with our analysis we will briefly examine the different combinations of missing values for the individual features in the training data. In this plot, the frequency of NAs per feature is shown in the bar plot at the top. The main part of the plot shows all the combinations where NAs occur in the same row for more than 1 feature. Thus, the more red rectangles a feature has the more features it shares NAs with.

train %>%
  select(which(colMeans(is.na(.)) > 0)) %>%
  aggr(prop = FALSE, combined = TRUE, numbers = TRUE, bars = FALSE, cex.axis = 0.7)
Fig. 9

Fig. 9

We find:

  • The features ps_car_03_cat and ps_car_05_cat have the largest number of NAs. They also share numerous instances where NAs occur in both of them for the same row.

  • There are features that share a lot of NA rows with other features, for instance ps_reg_03. Others are exclusive, like ps_car_12, or almost exclusive like ps_car_11 or *ps_car_02.cat.

  • About 2.5% of values are missing in total in eacho of the train and test data sets:

sum(is.na(train))/(nrow(train)*ncol(train))*100
## [1] 2.410359
sum(is.na(test))/(nrow(test)*ncol(test))*100
## [1] 2.453096

Claim rates for individual features

In order to determine the importance of the individual parameters we will study the distribution of their claim rates, i.e. how large a fraction per categorical variable ended up claiming or how the distributions of claimed vs unclaimed compare. This analysis will follow a similar structure as above.

Here we estimate error bars from Binomial 95% confidence limits based on the count statistics of the claim vs no claim cases. These will help us to decide whether potential differences are significant or not. The corresponding helper function get_binCI is defined in Section 2.2 above.

Binary features part 1

p1 <- train %>%
  group_by(ps_ind_06_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_06_bin, frac_claim, fill = ps_ind_06_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p2 <- train %>%
  group_by(ps_ind_07_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_07_bin, frac_claim, fill = ps_ind_07_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p3 <- train %>%
  group_by(ps_ind_08_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_08_bin, frac_claim, fill = ps_ind_08_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p4 <- train %>%
  group_by(ps_ind_09_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_09_bin, frac_claim, fill = ps_ind_09_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p5 <- train %>%
  group_by(ps_ind_10_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_10_bin, frac_claim, fill = ps_ind_10_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p6 <- train %>%
  group_by(ps_ind_11_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_11_bin, frac_claim, fill = ps_ind_11_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p7 <- train %>%
  group_by(ps_ind_12_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_12_bin, frac_claim, fill = ps_ind_12_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p8 <- train %>%
  group_by(ps_ind_13_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_13_bin, frac_claim, fill = ps_ind_13_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

layout <- matrix(c(1,2,3,4,5,6,7,8),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, layout=layout)
Fig. 10

Fig. 10

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1

We find that the first batch of binary features has significant differences in the claim fractions especially for the 4 top panels. There are differences in the bottom panels but they are much less significant.

Binary features part 2

p1 <- train %>%
  group_by(ps_ind_16_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_16_bin, frac_claim, fill = ps_ind_16_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p2 <- train %>%
  group_by(ps_ind_17_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_17_bin, frac_claim, fill = ps_ind_17_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p3 <- train %>%
  group_by(ps_ind_18_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_18_bin, frac_claim, fill = ps_ind_18_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p4 <- train %>%
  group_by(ps_calc_15_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_15_bin, frac_claim, fill = ps_calc_15_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p5 <- train %>%
  group_by(ps_calc_16_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_16_bin, frac_claim, fill = ps_calc_16_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p6 <- train %>%
  group_by(ps_calc_17_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_17_bin, frac_claim, fill = ps_calc_17_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p7 <- train %>%
  group_by(ps_calc_18_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_18_bin, frac_claim, fill = ps_calc_18_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p8 <- train %>%
  group_by(ps_calc_19_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_19_bin, frac_claim, fill = ps_calc_19_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

p9 <- train %>%
  group_by(ps_calc_20_bin, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_20_bin, frac_claim, fill = ps_calc_20_bin)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(y = "Claims [%]")

layout <- matrix(c(1,2,3,4,5,6,7,8,9,9),2,5,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, layout=layout)
Fig. 11

Fig. 11

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1

We find that the 2nd part of the binary features have much smaller, statistically insignificant differences in claim rate on average. The only exceptions are ps_ind_16_bin and ps_ind_17_bin which are significant.

Categorical features part 1

Note that the features are reordered by claims fraction; except for NA, which is always at the end:

p1 <- train %>%
  group_by(ps_ind_02_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_ind_02_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_02_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_02_cat", y = "Claims [%]")

p2 <- train %>%
  group_by(ps_ind_04_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_ind_04_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_04_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_04_cat", y = "Claims [%]")

p3 <- train %>%
  group_by(ps_ind_05_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_ind_05_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_05_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_05_cat", y = "Claims [%]")

p4 <- train %>%
  group_by(ps_car_01_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_01_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_01_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_01_cat", y = "Claims [%]")

p5 <- train %>%
  group_by(ps_car_02_cat, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_02_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_02_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_02_cat", y = "Claims [%]")

p6 <- train %>%
  group_by(ps_car_03_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_03_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_03_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_03_cat", y = "Claims [%]")

layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 12

Fig. 12

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1

We find that interestingly enough there appears to be a correlation with the claims rate and whether a certain feature is existing; as evidenced by the high fractions among the NAs for most features. Besides this strong effect there might by a more suble dependence on ps_ind_05_cat especially in “2” vs “0”.

Categorical features part 2

p1 <- train %>%
  group_by(ps_car_04_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_04_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_04_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_04_cat", y = "Claims [%]")

p2 <- train %>%
  group_by(ps_car_05_cat, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_05_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_05_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_05_cat", y = "Claims [%]")

p3 <- train %>%
  group_by(ps_car_06_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_06_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_06_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_06_cat", y = "Claims [%]")

p4 <- train %>%
  group_by(ps_car_07_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_07_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_07_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_07_cat", y = "Claims [%]")

p5 <- train %>%
  group_by(ps_car_08_cat, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_08_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_08_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_08_cat", y = "Claims [%]")

p6 <- train %>%
  group_by(ps_car_09_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_09_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_09_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_09_cat", y = "Claims [%]")

p7 <- train %>%
  group_by(ps_car_10_cat, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_10_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_10_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_10_cat", y = "Claims [%]")

p8 <- train %>%
  group_by(ps_car_11_cat, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(reorder(ps_car_11_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_11_cat)) +
  geom_col() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
  theme(legend.position = "none") +
  labs(x = "ps_car_11_cat", y = "Claims [%]")

layout <- matrix(c(1,1,2,3,4,4,5,5,6,6,7,7,8,8,8,8),4,4,byrow=TRUE)
multiplot(p1, p2, p4, p3, p5, p6, p7, p8, layout=layout)
Fig. 12

Fig. 12

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1

We find that the second batch of categorical features also shows a few interesting features, such as ps_car_06_cat whereas other (like “10”) don’t seem to be related to the claims rate at all. We notice again that “11_cat” has far more levels than any other categorical feature

Integer features part 1

For the integer features we will now use a scatter plot with (95% confidence) error bars to emphasise the natural ordering of the values:

p1 <- train %>%
  group_by(ps_ind_01, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_01, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_01", y = "Claims [%]")
  
p2 <- train %>%
  group_by(ps_ind_03, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_03, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_03", y = "Claims [%]")
  
p3 <- train %>%
  group_by(ps_ind_14, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_14, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_14", y = "Claims [%]")
  
p4 <- train %>%
  group_by(ps_ind_15, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_ind_15, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "ps_ind_15", y = "Claims [%]")
  
p5 <- train %>%
  filter(!is.na(ps_car_11)) %>%
  group_by(ps_car_11, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_car_11, frac_claim)) +
  geom_point(color = "red") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "red") +
  theme(legend.position = "none") +
  labs(x = "ps_car_11", y = "Claims [%]")

layout <- matrix(c(1,1,2,2,3,4,4,5),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)
Fig. 13

Fig. 13

We find:

  • These variables show much more significant signal in their claim rates in the range of 2-3 percentage points.

  • ps_ind_03 and ps_car_11 have notable drops in claim number from their maxima which are both at zero.

  • The claims fraction for ps_ind_01 has an increasing trend, wherease ps_ind_15 is almost monotonically decreasing.

  • Note, that here we removed the NAs from the ps_car_11 feature, because they are only 5 values (with zero claims) and would make the plot much harder to read.

Integer features part 2

For features with a larger range of values we’re switching to overlapping density plots. Here we have removed a few values with low counts and very large error bars for about half of the features:

p1 <- train %>%
  group_by(ps_calc_04, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_04, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_04", y = "Claims [%]")
  
p2 <- train %>%
  filter(ps_calc_05 < 6) %>%
  group_by(ps_calc_05, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_05, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_05", y = "Claims [%]")
  
p3 <- train %>%
  filter(ps_calc_06 > 2) %>%
  group_by(ps_calc_06, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_06, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_06", y = "Claims [%]")
  
p4 <- train %>%
  filter(ps_calc_07 < 8) %>%
  group_by(ps_calc_07, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_07, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_07", y = "Claims [%]")
  
p5 <- train %>%
  filter(ps_calc_08 > 2) %>%
  group_by(ps_calc_08, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_08, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_08", y = "Claims [%]")

p6 <- train %>%
  group_by(ps_calc_09, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_09, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_09", y = "Claims [%]")

p7 <- train %>%
  ggplot(aes(ps_calc_10, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.4) +
  theme(legend.position = "none")

p8 <- train %>%
  ggplot(aes(ps_calc_11, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.4) +
  theme(legend.position = "none")

p9 <- train %>%
  filter(ps_calc_12 < 9) %>%
  group_by(ps_calc_12, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_12, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_12", y = "Claims [%]")

p10 <- train %>%
  filter(ps_calc_13 < 12) %>%
  group_by(ps_calc_13, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ggplot(aes(ps_calc_13, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "ps_calc_13", y = "Claims [%]")

p11 <- train %>%
  ggplot(aes(ps_calc_14, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.4)

layout <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,11),3,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, layout=layout)
Fig. 14

Fig. 14

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • There is no statistically significant signal in the scatter plots. All error bars overlap in each plot and the slight wiggles that we see might very well just be random variation.

  • The density plots show no difference either. Each plot uses two overlapping colours with alpha blending, as shown in the legend on the bottom right, but you only see the blended result because the overlap is practically perfect.

Float features part 1

We examine the floating point features using overlapping density plots only:

p1 <- train %>%
  ggplot(aes(ps_reg_01, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_reg_02, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_reg_03, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_calc_01, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p5 <- train %>%
  ggplot(aes(ps_calc_02, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p6 <- train %>%
  ggplot(aes(ps_calc_03, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05)

layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 15

Fig. 15

We find:

  • There are small but notable differences in the reg features, which are related to the region of the policy holder. Lower region values appear to show a lower number of claims on average.

  • The essentially uniform ps_calc_01 - ps_calc_03 features show almost perfect overlap. There might be some small differences but they could well be random.

Float features part 2

p1 <- train %>%
  ggplot(aes(ps_car_12, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_car_13, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_car_14, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_car_15, fill = target)) +
  geom_density(alpha = 0.5, bw = 0.1) +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 16

Fig. 16

We find that the float car features also show small deviations towards higher numbers having more claims. While the effects might well be statistically significant their practical impact could be small.

Summary

In terms of the impact of the target on the individual features we clearly see strong differences in claim rates within and between the different groups. Some of the binary features show significant effects in the range of 1-2 percentage points, whereas others are show no practical impact. Specifically, most “ind” features have a clear influence claims whereas the calc binary features are neutral. The same appears to be true for the calc integer and floating point features; suggesting that the calc group in general is not of immediate usefulness for our prediction goal.

The strongest impact on claim rates is shown by the categorical and integer features; in particular the “ind” and “car” variables which can vary in the range of 2-3 percentage points. The effect on the floating point features is much more subtle, but might prove useful in getting the best prediction out of this data.

Multi-feature comparisons

After studying each feature individually we will now start to look at interactions between them. The annoymity of the features will make it more difficult to interpret these relations. However, they will still be useful for our prediction goal and for gaining a more detailed understanding of our data.

Correlation overview

We begin with a correlation matrix plot as a first comprehensive overview of our multi-parameter space. We will then take this overview plot as a starting point to investigate specific multi-feature comparisons in the following. Those examinations will likely result in more questions, which we can also examine (to a certain extend) in the same step.

What we will see here is the correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation.

Here we will only include those features that we found to be useful in the previous step. Thereby, our plot will be easier to read and more informative. Note that for the purpose of this plot we will recode our binary features and categorical as integers. All rows with NAs are excluded.

train %>%
  select(-starts_with("ps_calc"), -ps_ind_10_bin, -ps_ind_11_bin, -ps_car_10_cat, -id) %>%
  mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
  mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
  mutate(target = as.integer(target)) %>%
  cor(use="complete.obs", method = "spearman") %>%
  corrplot(type="lower", tl.col = "black",  diag=FALSE)
Fig. 17

Fig. 17

In this kind of plot we want to look for the bright, large circles which immediately show the strong correlations (size and shading depends on the absolute values of the coefficients; colour depends on direction). Anything that you would have to squint to see is usually not worth seeing.

We find:

  • Most features appear to be primarily correlated with others in their group. We can see this by studying the upper right region near where the diagonal would be and comparing it to the lower left area of the plot.

  • There is no obvious correlation with the target feature in the left-most column. This could be caused by the sparsity of the target == 1 values.

Here we plot only the moderately to highly correlated features by showing their correlation coefficients directly:

train %>%
  select(ps_ind_12_bin, ps_ind_14, ps_ind_16_bin, ps_ind_17_bin, ps_ind_18_bin, ps_reg_02,
         ps_reg_03, ps_car_12, ps_car_13, ps_car_14, ps_car_15, ps_car_02_cat, ps_car_04_cat) %>%
  mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
  mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
  cor(use="complete.obs", method = "spearman") %>%
  corrplot(type="lower", tl.col = "black",  diag=FALSE, method = "number")
Fig. 18

Fig. 18

We find:

  • There is a very strong correlation between ps_ind_12_bin and ps_ind_14, which is an ordinal integer feature with 5 levels. Other correlations that exist are weaker but still notable.

  • The correlation between the “reg” and “car” features, respectively, shows how continuous variables are related. In particular ps_car_14, which showed only a small effect in the individual plots, might be interesting here.

  • The anti-correlations are generally not as strong, with ps_ind_16_bin and ps_ind_18_bin accounting for the strongest coefficient with -0.58.

Alluvial diagram for key features

For a different kind of visualisation of multi-feature interaction we can use an alluvial plot. Made available through the alluvial package, those plots are a kind of mix between a flow chart and a bar plot and they show how the target categories relate to various discrete features. Here we pick a subset of the more strongly correlated features we had identified in the previous plot:

allu_train <- train %>%
  filter(!is.na(ps_car_02_cat)) %>%
  filter(ps_car_04_cat %in% c("0","1","2","8","9")) %>%
  group_by(target, ps_ind_17_bin, ps_ind_18_bin, ps_car_02_cat, ps_car_04_cat) %>%
  count() %>%
  ungroup
  
alluvial(allu_train %>% select(-n),
         freq=allu_train$n, border=NA,
         col=ifelse(allu_train$target == 0, "red", "blue"),
         cex=0.75,
         hide = allu_train$n < 200,
         ordering = list(
           order(allu_train$target==1),
           NULL,
           NULL,
           NULL,
           NULL))
Fig. 19

Fig. 19

In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency. We decided to hide the alluvia with the lowest frequencies using the parameter of the same name. Within a category block you can re-order the vertical layering of the alluvia to emphasise certain aspects of your data set.

We see nicely how the small proportion of positive target values (on the upper left) contributes to the make up of the different features. The dominating binary feature values can clearly be seen to carry an approximately proportional amount of claims. Feel free to fork this plot to explore the composition of other features.

Exploring correlated features

Now we will have a closer look at those features we find to be correlated. Here our visualisation will depend on the feature group and how strong the relationship is.

Pairwise relationships

We begin with a layout of plots that examine the one-to-one relations for all pairings with an (absolute value) correlation coefficient above 0.5:

p1 <- train %>%
  ggplot(aes(ps_ind_14, fill = ps_ind_12_bin)) +
  geom_bar(position = "fill")

p2 <- train %>%
  ggplot(aes(ps_ind_16_bin, ps_ind_18_bin)) +
  geom_count(color = "orange")

p3 <- train %>%
  ggplot(aes(ps_ind_16_bin, ps_ind_17_bin)) +
  geom_count(color = "orange")

p4 <- train %>%
  ggplot(aes(ps_reg_02, ps_reg_03)) +
  geom_point() +
  geom_smooth(method = 'gam', color = "dark green")
  
p5 <- train %>%
  ggplot(aes(ps_car_12, ps_car_13)) +
  geom_point() +
  geom_smooth(method = 'gam', color = "red")
  
p6 <- train %>%
  ggplot(aes(ps_car_12, ps_car_14)) +
  geom_point() +
  geom_smooth(method = 'gam', color = "red")
  
layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p4, p2, p3, p5, p6, layout=layout)
Fig. 20

Fig. 20

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • The barplot of ps_ind_14 demonstrates the strong correlation with ps_ind_12_bin that stems mostly from the practically exclusive association of ps_ind_14 == 0 with ps_ind_12_bin == FALSE and ps_ind_14 == 1 with ps_ind_12_bin == TRUE. And even the ps__14 values in between show a gradually increasing fraction of ps_ind_12_bin == TRUE.

  • We study the binary feature relations using count plots, where the size of the dots correspond to the number of cases. For ps_ind_16_bin vs ps_ind_17_bin and ps_ind_18_bin we see that the predominant associations are ps_ind_16_bin == TRUE with ps_ind_17_bin == FALSE and ps_ind_18_bin == FALSE, respectively. There are no cases in these two relations were both features are TRUE.

  • For the floating point features we use a scatter plot together with a simple smoothing model to visualise their relationships. In case of ps_reg_02 we clearly see its distinct values. Here the linear relation is formally present but doesn’t look too convincing, considering the largely overlapping ranges of ps_reg_03.

  • In contrast, the two ps_car feature plots show a decent correlation that might even be underestimated by the coefficients due to the presence of obvious outliers.

Interactive multi-dimensional relations

Let’s add a little pinch of interactivity to our EDA recipe. Below we are using the R wrapper for the plotly toolkit which allows us to create plots we can examine from different angles and zoom stages. Here we use it to visualise the covariation between the three ps_car features (12, 13, 14) and its impact on the (colour-coded) target variable. Note, that we’re only using a subset of the data to keep the plot at a manageable size:

set.seed(4321)
train %>%
  filter(!is.na(ps_car_12) & !is.na(ps_car_13) & !is.na(ps_car_14)) %>%
  select(ps_car_12, ps_car_13, ps_car_14, target) %>%
  sample_n(5e4) %>%
  plot_ly(x = ~ps_car_12, y = ~ps_car_13, z = ~ps_car_14,
          color = ~target, colors = c('#BF382A', '#0C4B8E'),
          text = ~paste('ps_car_12:', ps_car_12,
                        '<br>ps_car_13:', ps_car_13,
                        '<br>ps_car_14:', ps_car_14,
                        '<br>Target:', target)
          ) %>%
  add_markers() %>%
  layout(title = "'Car' group correlations and target impact",
         scene = list(xaxis = list(title = 'ps_car_12'),
                     yaxis = list(title = 'ps_car_13'),
                     zaxis = list(title = 'ps_car_14')))

Fig. 21

We find that the correlations can clearly be seen and are leading to a characteristic plane in the 3D parameter space. The target == 1 variables appear to be homogeneously dotted throughout this plane. Have a go a exploring the data through this plot :-)

Multi-parameter grid

Now we are putting all the correlated pieces together in a comprehensive, facetted overview plot for the main parameters in this section. These parameters don’t show an obvious connection in the correlation plot, but it is worth looking for more subtle interactions here.

We are plotting the distribution of ps_car_14 (which is correlated with ps_car_12/13) for the 5 ps_ind_14 classes (correlated with ps_ind_12_bin) using ridgeline plots through ggridges. Ridgeline plots allow for a quick comparison of overlapping (density) curves.

The we separate those overlapping curves by ps_ind_16_bin (correlated with ps_ind_17_bin and ps_ind_18_bin) horizontally and finally by target vertically. This is the result:

train %>%
  ggplot(aes(ps_car_14, reorder(ps_ind_14, - ps_car_14, FUN = median), fill = as.factor(ps_ind_14))) +
  geom_density_ridges() +
  labs(y = "ps_ind_14") +
  theme(legend.position = "none", plot.title = element_text(size=11)) +
  facet_grid(ps_ind_16_bin ~ target) +
  ggtitle("Target (left/right) vs ps_ind_16_bin (top/bottom) for ps_car_14 (x) vs ps_ind_14 (y, col)")
Fig. 22

Fig. 22

We find:

  • Reading from left to right and top to bottom the number of populated levels of ps_ind_14 (i.e. number of colours) steadily decreases. That means that target == 1 & ps_ind_16_bin == TRUE only contain ps_ind_14 <= 1. In general, target == 1 (i.e. claims) contains fewer occupied levels of ps_ind_14 than target == 0 (no claims).

  • The distributions of ps_car_14 show notable differences for different ps_ind_14 levels troughout the grid. In particular ps_ind_14 == 0 is associated with significantly larger ps_car_14 values.

  • The ps_ind_14 levels from 1-3 are much more similar but also show subtle differences particularly towards larger values of ps_car_14. We have already seen above that level 4 is only sparsely populated.

The impact of uncorrelated high-signal features

The previous section has revealed a notable amount of variation in the continuous feature ps_car_14 from interactions with correlated categorical features. This was something that we did not see based on the distribution of ps_car_14 alone. Therefore, it is reasonable to ask whether other features might interact in a similar way. That analysis is the subject of this section.

We start by relating the same ps_car_14 to those integer features that we found to have a strong impact on the claim rates. Namely, those are the ps_ind features “01”, “03”, and “15” as well as ps_car_11:

p1 <- train %>%
  ggplot(aes(ps_car_14, reorder(ps_car_11, - ps_car_14, FUN = median), fill = as.factor(ps_car_11))) +
  geom_density_ridges() +
  labs(y = "ps_car_11") +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(ps_car_14, reorder(ps_ind_01, - ps_car_14, FUN = median), fill = as.factor(ps_ind_01))) +
  geom_density_ridges() +
  labs(y = "ps_ind_01") +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(ps_car_14, reorder(ps_ind_03, - ps_car_14, FUN = median), fill = as.factor(ps_ind_03))) +
  geom_density_ridges() +
  labs(y = "ps_ind_03") +
  theme(legend.position = "none")

p4 <- train %>%
  ggplot(aes(ps_car_14, reorder(ps_ind_15, - ps_car_14, FUN = median), fill = as.factor(ps_ind_15))) +
  geom_density_ridges() +
  labs(y = "ps_ind_15") +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 23

Fig. 23

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • Our intuition was useful with regard to ps_car_11 and also ps_ind_01. The car feature leads to strong differences in the ps_car_14 distribution - essentially disecting the distribution into distinct peaks. The signature of the NAs is interesting too. In case of the ind feature there is a smaller amount of variation in the size of the multiple peaks.

  • For ps_ind “03” and “15”, however, the differences are very subtle and practically non-existent; respectively. Despite their strong impact on the claim rates the variation features appear to leave the ps_car_14 distribution virtually unchanged.

Feature engineering

Once we have sufficiently explored the connection within the existing features our next step will be to use these insights to build new features, derived from interactions or other characteristics of the data. As in some of my other kernels, the new features will be defined in the following single code block and then studied in the sub-sections below. This gives us the advantage of having everything in one place instead of dotted accross this section. This code block will grow as we engineer new features.

There is of course one problem here: How to do this engineering from anonymised features? Not knowing what our variables represent makes it more difficult to interpret their connections. Fortunately, we still have a couple of tricks up our sleeve. Let’s see how useful they are :-)

We work on the combined data frame to make sure that our train and test data are treated consistently.

# number of NAs
nano <- combine %>%
  is.na() %>%
  rowSums() %>%
  as.integer()

# Sum up "ind" binary columns
bin_ind <- combine %>% select(ends_with("bin")) %>%
  select(starts_with("ps_ind")) %>%
  rowSums() %>%
  as.integer()

# Sum up "calc" binary columns
bin_calc <- combine %>% select(ends_with("bin")) %>%
  select(starts_with("ps_calc")) %>%
  rowSums() %>%
  as.integer()

# "ind" binary column differences per row
# (uses "rep" to create a combine-size data frame from the
# single reference row that we then subtract from all rows)
bins <- combine %>%
  select(ends_with("bin")) %>%
  select(starts_with("ps_ind")) %>%
  mutate_all(funs(as.integer))

ref_bin <- bins %>% summarise_all(median, na.rm = TRUE)
ref_bin <- ref_bin[rep(1,nrow(combine)),]

diff_ind <- rowSums(abs(bins - ref_bin))

# "calc" binary column differences per row
bins <- combine %>%
  select(ends_with("bin")) %>%
  select(starts_with("ps_calc")) %>%
  mutate_all(funs(as.integer))

ref_bin <- bins %>% summarise_all(median, na.rm = TRUE)
ref_bin <- ref_bin[rep(1,nrow(combine)),]

diff_calc <- rowSums(abs(bins - ref_bin))


# Apply changes to combine frame
combine <- combine %>%
  mutate(nano = nano,
         bin_ind = bin_ind,
         bin_calc = bin_calc,
         diff_ind = diff_ind,
         diff_calc = diff_calc)

# Split into train vs test
train <- combine %>%
  filter(dset == "train")
test <- combine %>%
  filter(dset == "test")

Number of NAs per ID

We start with the number of NAs for each policy holder ID. Sometimes it’s just as important to take into account how much we don’t know about a certain user than work with the information we have. Below, we examine the distribution of NAs per ID throughout all features (i.e. per row for all columns) and the corresponding claim rates.

Note the logarithmic y-axis in the histogram. For the claim rates, we make use of a facet_zoom, via the ggforce package, to examine the observations with few NAs in greater detail. The zoomed area (on the left) is highlighted in the full plot (on the right) in a darker shade of grey:

p1 <- train %>%
  ggplot(aes(nano, fill = as.factor(nano))) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none") +
  labs(x = "Number of NAs")

p2 <- train %>%
  group_by(nano, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ungroup() %>%
  ggplot(aes(nano, frac_claim)) +
  geom_point(color = "red") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "Number of NAs", y = "Claims [%]") +
  facet_zoom(x = nano < 5, y = frac_claim < 10)

layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)
Fig. 24

Fig. 24

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • The number of NAs is predominantly low with most IDs having <= 2 NA entries (log axis). Interestingly, 2 NAs, not zero, is the most common number. Beyond that, the numbers drop quickly and there are only a few cases with more than 5 NAs. It is interesting to note that no ID has exactly 5 NAs, which might reflect the way our features are related. There is a small second peak at 7 NAs, which might also reveal some underlying structure in the data.

  • The claim rates show a very interesting pattern with a steady decline from 0 to 4 within the percentage range of 2-5%. For 6 and 7 NAs the rates are very high in comparison (means of 60% and 40%) and again there is a declining trend. For 8 NAs there are not enough cases to make a statistically significant statement.

  • The differences between 0-4 and 6-8 NAs suggests that we might have two distinct populations here which reflect differences in the policy holder’s characteristics. It will be well worth coming back to these distinctions during the course of our further analysis.

Sum of binary features

Next, we want to examine what happens if we sum up all the binary feature values of each group for the different IDs. There are two groups that have binary features: “ind” (the individual policy holder characteristics) and “calc” (calculated features). Here we show their distributions and claim rates:

p1 <- train %>%
  ggplot(aes(bin_ind, fill = as.factor(bin_ind))) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none") +
  labs(x = "Sum of binary 'ind' features")

p2 <- train %>%
  filter(bin_ind < 6) %>%
  group_by(bin_ind, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ungroup() %>%
  ggplot(aes(bin_ind, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "Sum of binary 'ind' features", y = "Claims [%]")

p3 <- train %>%
  ggplot(aes(bin_calc, fill = as.factor(bin_calc))) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none") +
  labs(x = "Sum of binary 'calc' features")

p4 <- train %>%
  filter(bin_ind < 6) %>%
  group_by(bin_calc, target) %>%
  count() %>%
  spread(target, n) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ungroup() %>%
  ggplot(aes(bin_calc, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "Sum of binary 'calc' features", y = "Claims [%]")

layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 25

Fig. 25

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • Adding up the “ind” binary features results in a distribution which peaks at 2 and has a range of 1-6. There is no “0”, which means that at least one of the binary flags is always set. At the beginning of this kernel we saw that there are a total of 11 “ps_ind_??_bin” features but we only have a maximum of six set to “TRUE/1” at the same time. This is consistent with “FALSE/0” being the dominating value for most of these features.

  • For plotting the sum of binary “ind” features (bottom left) we see that there are only a few instances of 6 binary flags set (top left). This leads to a large statistical error bar that would dominate the plot and make it harder to read. Therefore, we remove binary sum == 6 from the bottom left plot. For the remaining sums there is a trend of increasing claim rate with increasing sum; but only “3” vs “1” or “2” is statistically significant. Still, this could be a useful new feature.

  • The “calc” binary features on the other hand display no significant variation in their claim rates; thus providing further evidence that the “calc” group might not be very useful for our prediction goal. Here the sum distribution also peaks at 2, but takes a minimum of 0 and also a maximum of 6 for in total 6 “calc” features. Toward larger sums the (log-scaled) distribution has less of a sharp decline than for the “ind” features.

Difference measure for binary features

After adding up the different binary values, our next feature will be a simple measure of how different certain binary observations are. We define a reference row, which corresponds to the median value of each binary feature. This definition was changed from the original one following constructive criticism by David J. Slate and Oscar Takeshita in the comments below. They were rightfully pointing out that this feature is not independent of the reference row.

We then subtract the binary values of this reference row from the set of binary values for each row. This determines all the bins that are different (i.e. are “0” where the reference row is “1” and vice versa). To properly count the differences we are summing up the absolute values of the resulting data frame, thereby adding a value of “-1” as “1” to our difference measure. Without the absolute values, a “-1” and a “1” would cancel out, but both indicate that there is a difference in the binary features and two differences should add up to “2”. This is the main difference between this feature and the binary sum feature above.

We do this for the “ind” and the “calc” binary features separately. Below are the resulting plots for the distribution and the claim rates for each distance measure. Note the square-root scaling in the y-axis of the bar plot:

p1 <- train %>%
  ggplot(aes(as.factor(diff_ind), fill = as.factor(diff_ind))) +
  geom_bar() +
  scale_y_sqrt() +
  labs(fill = "diff_ind", x = "Absolute difference of binary 'ind' features")

p2 <- train %>%
  ggplot(aes(as.factor(diff_calc), fill = as.factor(diff_calc))) +
  geom_bar() +
  scale_y_sqrt() +
  labs(fill = "diff_calc", x = "Absolute difference of binary 'calc' features")

p3 <- train %>%
  filter(diff_ind < 7) %>%
  group_by(diff_ind, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ungroup() %>%
  ggplot(aes(diff_ind, frac_claim)) +
  geom_point(color = "orange") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
  theme(legend.position = "none") +
  labs(x = "Absolute difference of binary 'ind' features", y = "Claims [%]")

p4 <- train %>%
  filter(diff_calc < 6) %>%
  group_by(diff_calc, target) %>%
  count() %>%
  spread(target, n, fill = 0) %>%
  mutate(frac_claim = `1`/(`1`+`0`)*100,
         lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
         upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
         ) %>%
  ungroup() %>%
  ggplot(aes(diff_calc, frac_claim)) +
  geom_point(color = "blue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
  theme(legend.position = "none") +
  labs(x = "Absolute difference of of binary 'calc' features", y = "Claims [%]")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 26

Fig. 26

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find:

  • The differences of “ind” binary values are dominated by values 1 and 3. Interestingly, there are no values of zero, indicating that our median reference row does not actually occur in the data. We’re seeing a pattern in the claim rates suggesting that the claim fraction rises the further away we step from our reference row values. However, only the differences between values 1 & 2 vs 3 & 4 appear to be statistically significant.

  • In the “calc” binary features we see that we still haven’t managed to find a useful signal. There might be a tentative decrease in claim rate between the values of 2 vs 3, but everything is consistent within the uncertainties. Therefore, it appears that there is very little predictive power in the “calc” group.

Modelling preparation

Before we start building our prediction model we want to run a few sanity checks and get our data in shape for the model fitting. These preparations are the subject of this section.

Train vs test comparison

One thing we want to make sure is that the train and test data do actually cover similar parts of the parameter space. There is very little use in having a feature with a strong target correlation in our training data if those crucial feature levels are absent in the test data.

First, we’ll prepare the data and define a new helper function. The following code block does most of the (relatively) heavy lifting in this subsection. We will outline its various steps as we explore the results.

bin_col <- combine %>% select(ends_with("bin")) %>%
  colnames() %>% unlist(use.names = FALSE) %>% as.character()
cat_col <- combine %>% select(ends_with("cat")) %>%
  colnames() %>% unlist(use.names = FALSE) %>% as.character()

train_frac_bin <- NULL
for (i in bin_col){
  foo <- combine %>%
    group_by(!!sym(i), dset) %>%
    count() %>%
    spread(dset, n, fill = 0) %>%
    mutate(frac_train = train/(train+test)*100,
           lwr = get_binCI(train,(train+test))[[1]]*100,
           upr = get_binCI(train,(train+test))[[2]]*100
           )
  train_frac_bin <- tibble(name = i,
                           value = c(FALSE, TRUE),
                           tfrac = c(foo$frac_train),
                           lwr = c(foo$lwr),
                           upr = c(foo$upr)) %>%
    bind_rows(train_frac_bin)
}

plot_cat_train_test <- function(col){
  col <- enquo(col)
  combine %>%
    group_by(!!col, dset) %>%
    count() %>%
    spread(dset, n, fill = 0) %>%
    mutate(frac_train = train/(train+test)*100,
           lwr = get_binCI(train,(train+test))[[1]]*100,
           upr = get_binCI(train,(train+test))[[2]]*100,
           col = !!col
           ) %>%
    ggplot(aes(col, frac_train)) +
    geom_point() +
    geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7) +
    labs(x = as.character(col)[2], y = "")
}

We begin with a comparison of the train vs test data in two different aspects: First the distribution of id values and second a closer look at the binary features in which we plot the relative fraction of train observations for each level:

p1 <- combine %>%
  ggplot(aes(id, fill = dset)) +
  geom_density(bw = 1, alpha = 0.5) +
  coord_cartesian(ylim = c(7.3e-4,8.3e-4))

p2 <- train_frac_bin %>%
  ggplot(aes(name, tfrac, color = value)) +
  geom_point() +
  geom_errorbar(aes(ymin = lwr, ymax = upr, color = value), width = 0.5, size = 0.7) +
  labs(x = "Binary 'ind' features", y = "Training set percentage") +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)
Fig. 27

Fig. 27

p1 <- 1; p2 <- 1

We find:

  • The id values appear to be pretty randomly distributed. There is no obvious id range where we find predominantly train or test data. We’re using a density plot here to copmare the relative fluctuations.

  • The binary feature levels are remarkably similar in terms of their train/(train+test) percentage. We see that on average the train data makes up about 40% of each feature and for each TRUE & FALSE level, respectively. Levels with larger deviation have only a few objects and therefore large errors. Everything is consistent within the uncertainties.

Next, we will have a look at the levels of the categorical features. Our approach will be the same as for the binary features, in that we compute the train/(train+test) percentage for each level and then plot everything on a comprehensive plot. This figure is rather busy and the important result is in the overall picture rather than the details:

#str_c("p",seq(1,length(cat_col))," <- plot_cat_train_test(",cat_col)

p1 <- plot_cat_train_test(ps_ind_02_cat) 
p2 <- plot_cat_train_test(ps_ind_04_cat)
p3 <- plot_cat_train_test(ps_ind_05_cat) 
p4 <- plot_cat_train_test(ps_car_01_cat)
p5 <- plot_cat_train_test(ps_car_02_cat)
p6 <- plot_cat_train_test(ps_car_03_cat)
p7 <- plot_cat_train_test(ps_car_04_cat)
p8 <- plot_cat_train_test(ps_car_05_cat)
p9 <- plot_cat_train_test(ps_car_06_cat)
p10 <- plot_cat_train_test(ps_car_07_cat)
p11 <- plot_cat_train_test(ps_car_08_cat)
p12 <- plot_cat_train_test(ps_car_09_cat)
p13 <- plot_cat_train_test(ps_car_10_cat)
p14 <- plot_cat_train_test(ps_car_11_cat)

layout <- matrix(seq(1,14),2,7,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, layout=layout)
Fig. 28

Fig. 28

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1
p11 <- 1; p12 <- 1; p13 <- 1; p14 <- 1

We find:

  • Once more, practically all levels appear to be consistent with a 40% train percentage, which is even more remarkable than for the binary features.

  • There are a few tentative outliers, e.g. in ps_ind_05_cat, but for so many levels that’s almost expected for the 95% confidence uncertainties that we are plotting.

Finally, we will also sample some of the floating-point features. Here we use overlapping density plots for the train vs test observations:

p1 <- combine %>%
  ggplot(aes(ps_reg_01, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p2 <- combine %>%
  ggplot(aes(ps_reg_02, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p3 <- combine %>%
  ggplot(aes(ps_reg_03, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p4 <- combine %>%
  ggplot(aes(ps_calc_01, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p5 <- combine %>%
  ggplot(aes(ps_calc_02, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

p6 <- combine %>%
  ggplot(aes(ps_calc_03, fill = dset)) +
  geom_density(alpha = 0.5, bw = 0.05)

layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 29

Fig. 29

p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1

We find that for these features the overlap appears to be practically perfect. We’re using two different colours, with an alpha-mixing parameter of 0.5, but all we see is the blended overlap.

In summary, we find the train vs test data sets to be remarkably similar in the distributions of their features. The train/test split appears to be well executed and stratified by feature levels.

Feature selection, evaluation metric, and validation split

Not all features in our data set will be important. In particular, we have seen that the calc features appear to be of limited usefulness. Here we only include meaningful variables. The following code block, adapted from my recent Taxi competition kernel is designed to be easily modified to a new challenge:

# Feature formatting
combine <- combine %>%
  mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
  mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
  mutate(target = as.integer(levels(target))[target])
# Specific definitions:
#---------------------------------
# predictor features
ind_cols <- c("ps_ind_01","ps_ind_02_cat","ps_ind_03","ps_ind_04_cat","ps_ind_05_cat","ps_ind_06_bin","ps_ind_07_bin","ps_ind_08_bin","ps_ind_09_bin","ps_ind_10_bin","ps_ind_11_bin","ps_ind_12_bin","ps_ind_13_bin","ps_ind_14","ps_ind_15","ps_ind_16_bin","ps_ind_17_bin","ps_ind_18_bin")
reg_cols <- c("ps_reg_01","ps_reg_02","ps_reg_03")
car_cols <- c("ps_car_01_cat","ps_car_02_cat","ps_car_03_cat","ps_car_04_cat","ps_car_05_cat","ps_car_06_cat","ps_car_07_cat","ps_car_08_cat","ps_car_09_cat","ps_car_10_cat","ps_car_11_cat","ps_car_11","ps_car_12","ps_car_13","ps_car_14","ps_car_15")
calc_cols <- c("ps_calc_01","ps_calc_02","ps_calc_03","ps_calc_04","ps_calc_05","ps_calc_06","ps_calc_07","ps_calc_08","ps_calc_09","ps_calc_10","ps_calc_11","ps_calc_12","ps_calc_13","ps_calc_14","ps_calc_15_bin","ps_calc_16_bin","ps_calc_17_bin","ps_calc_18_bin","ps_calc_19_bin","ps_calc_20_bin")
eng_cols <- c("nano", "bin_ind", "bin_calc", "diff_ind", "diff_calc")

train_cols <- c(ind_cols, reg_cols, car_cols, eng_cols)

# target feature
y_col <- c("target")
# identification feature
id_col <- c("id") 
# auxilliary features
aux_cols <- c("dset")
#---------------------------------

# General extraction
#---------------------------------
# extract test id column
test_id <- combine %>%
  filter(dset == "test") %>%
  select(!!sym(id_col))

# all relevant columns for train/test
cols <- c(train_cols, y_col, aux_cols)
combine <- combine %>%
  select_(.dots = cols)


# split train/test
train <- combine %>%
  filter(dset == "train") %>%
  select_(.dots = str_c("-",c(aux_cols)))
test <- combine %>%
  filter(dset == "test") %>%
  select_(.dots = str_c("-",c(aux_cols, y_col)))
#---------------------------------

The model evaluation metric here is the Normalised Gini Function. It has the advantage that only the relative order of the predicted probabilities is measured, not their absolute value. Thereby, your prediction is successful as long as it assigns low probabilities to true “no claims” observations and high probabilities to true “claims”. The “normalised” property means that probabilities are measured in the interval [0,1]. The metric then orders your predictions by probability and the more “claims” (or “no claims”) are among the high (or low) probabilities the better the result.

For a more extensive and intuitive explanation of the Gini metric I recommend this great kernel by kilian. There are numerous kernels already that deal with (efficiently) implementing this metric. We borrow our definition from this great work by Kevin (many thanks!):

xgb_normalizedgini <- function(preds, dtrain){
  actual <- getinfo(dtrain, "label")
  score <- NormalizedGini(preds,actual)
  return(list(metric = "NormalizedGini", value = score))
}

In order to assess how well our model generalises we will perform a cross-validation step and also split our training data into a train vs validation data set. Thereby, the model performance can be evaluated on a data sample that the algorithm has not seen. We split our data into 80/20 fractions:

set.seed(4321)
trainIndex <- createDataPartition(train$target, p = 0.8, list = FALSE, times = 1)

train <- train[trainIndex,]
valid <- train[-trainIndex,]

Normally, we might want to include a cleaning step where we remove wildly unlikely data points (created by typos or other mistakes) that doesn’t teach us anything about our problem at hand. Here, however the data sets appear to be very well behaved and no cleaning is necessary.

XGBoost parameters and fitting

We fit our model using XGBoost because it’s a robust and easy to use tool. In essence, it is a decision-tree gradient boosting algorithm with a high degree of popularity here on Kaggle. The parameters used here are not very sophisticated but should give you an idea how the algorithm works. I recommend to spend some time optimising the parameters and also to try out different tools presented in the other kernels.

A few brief words about gradient boosting, as far as I understand it: Boosting is what we call the step-by-step improvement of a weak learner (like a relatively shallow decision tree of max_depth levels) by successively applying it to the results of the previous learning step (for nrounds times in total). Gradient Boosting focusses on minimising the Loss Function (according to our evaluation metric) by training the algorithm on the gradient of this function. The method of Gradient Decent iteratively moves into the direction of the greatest decent (i.e. most negative first derivative) of the loss function. The step sizes can vary from iteration to iteration but has a multiplicative shrinkage factor eta in (0,1] associated with it for additional tuning. Smaller values of eta result in a slower decent and require higher nrounds.

In order for XGBoost to properly ingest our data samples we need to re-format them slightly:

#convert to XGB matrix
foo <- train %>% select(-target)
bar <- valid %>% select(-target)

dtrain <- xgb.DMatrix(as.matrix(foo),label = train$target)
dvalid <- xgb.DMatrix(as.matrix(bar),label = valid$target)
dtest <- xgb.DMatrix(as.matrix(test))

Now we define the meta-parameters that govern how XGBoost operates:

xgb_params <- list(colsample_bytree = 0.7, #variables per tree 
                   subsample = 0.7, #data subset per tree 
                   booster = "gbtree",
                   max_depth = 5, #tree levels
                   eta = 0.3, #shrinkage
                   eval_metric = xgb_normalizedgini,
                   objective = "reg:logistic",
                   seed = 4321,
                   nthread = -1
                   )

watchlist <- list(train=dtrain, valid=dvalid)

We first run a cross-validation (CV) step to measure our model’s performance and to determine how many iteration steps are optimal. To keep this stage short, within the limits of this kernel, I’m only running 50 iterations here (nrounds parameter). Ideally, you would want to use more.

We are using a 5-fold CV, which essentially means that we split our data in 5 parts and train on any 4 of them while evaluating on the remaining one. This is done iteratively, so that every fold becomes the validation fold. This is an efficient way to measure the performance of your fit on data it has not seen.

The parts below here are currently not running in the Kernel because of memory problems. I’m investigating. Sorry for that.

set.seed(1234)
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 5, nfold = 5, nrounds=50, maximize = TRUE)

The number of the best iteration is now stored in a variable we can access for training our classifier for the optimum number of rounds. Here and above we also set a random seed to ensure reproducability:

set.seed(4321)
gb_dt <- xgb.train(params = xgb_params,
                   data = dtrain,
                   print_every_n = 5,
                   watchlist = watchlist,
                   nrounds = xgb_cv$best_iteration)

Be aware that this is not a competitive model. Other (engineered) features can be included, the meta-parameters can be optimised, and a larger amount of training steps can be run. In addition, several kernels have already shown the power of ensembling different individual models into a combined prediction that shows better results.

What this basic model can do is to serve as an example on how to construct your own prediction and give you a place to start from. Make sure to check out other modelling kernels as well to get a broader overview.

Feature importance

After training we will check which features are the most important for our model. This can provide the starting point for an iterative process where we identify, step by step, the significant features for a model. Here we will simply visualise these features:

imp_matrix <- as.tibble(xgb.importance(feature_names = colnames(train %>% select(-target)), model = gb_dt))

imp_matrix %>%
  ggplot(aes(reorder(Feature, Gain, FUN = max), Gain, fill = Feature)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "Features", y = "Importance")

We find:

  • As pointed out in several other modelling kernels, the features ps_car_13 and ps_reg_03 are the most important ones for our tentative prediction model. Also note the two ps_ind features on 3rd and 4th rank.

  • In addition, our engineered features are not doing bad either. diff_ind is ranked the 5th most significant feature, and nano still has some impact, too. diff_calc, bin_calc, and bin_ind are towards the end of the list but (not only) in this competition small improvements can have a large impact on the resulting leaderboard score.

  • Those results are preliminary, since the model performance can certainly be optimised, but they already show that feature engineering is a promising way of improving your prediction.

Prediction and submission file

Once we are reasonably happy with our CV score we use the corresponding model to make a prediction for the test data, write the submission file, and submit it to the Kaggle leaderboard. This gives a performance score based on an independent data set. For this very reason, it is advisable to aim to keep the number of leaderboard submission low. Otherwise you run the risk to have your model influenced too much by this public leaderboard data which will result in overfitting. It’s better to trust your local CV.

Practically, this pre-ulimate code block will apply the XGBoost model to the testing data and write a submission file according to the competition specification. You will then find the file submit.csv in the “Output” tab of the Kaggle kernel or, if you have run the script locally, in the source directory.

pred <- test_id %>%
  mutate(target = predict(gb_dt, dtest))

pred %>% write_csv('submit.csv')

All that’s left to do is to double-check this submission file to make sure that we are not wasting a (potentially) valuable submission with a mal-formatted output file. We are comparing its shape and content with the sample submission file that we read in at the very beginning, and also plot the distribution of the predicted values compared to the training values for an approximate comparison:

identical(dim(sample_submit),dim(pred))
glimpse(sample_submit)
glimpse(pred)
bind_cols(sample_submit, pred) %>% head(5)

Prediction file format successfully verified.

And that’s it! Many thanks for reading this kernel! I hope that it will benefit your own analysis and I wish you the best of success.


Have fun!